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Abstract 

This  paper  presents  a  practical  solution  to  the  problem  of  high-fan-in,  high-bandwidth  synchronized  TCP  workloads  in  datacenter 
Ethernets — the  Incast  problem.  In  these  networks,  receivers  often  experience  a  drastic  reduction  in  throughput  when  simultaneously 
requesting  data  from  many  servers  using  TCP.  Inbound  data  overfills  small  switch  buffers,  leading  to  TCP  timeouts  lasting  hundreds 
of  milliseconds.  For  many  datacenter  workloads  that  have  a  synchronization  requirement  (e.g.,  filesystem  reads  and  parallel  data- 
intensive  queries ),  incast  can  reduce  throughput  by  up  to  90%. 

Our  solution  for  incast  uses  high-resolution  timers  in  TCP  to  allow  for  microsecond-granularity  timeouts.  We  show  that  this 
technique  is  effective  in  avoiding  incast  using  simulation  and  real-world  experiments.  Last,  we  show  that  eliminating  the  minimum 
retransmission  timeout  bound  is  safe /or  all  environments,  including  the  wide-area. 
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1  Introduction 


Num  Servers  vs  Goodput 

(Fixed  Block  =  1  MB,  buffer  =  64KB  (est.),  Switch  =  S50) 


In  its  35  year  history,  TCP  has  been  repeatedly  ehal- 
lenged  to  adapt  to  new  environments  and  technol¬ 
ogy.  Researchers  have  proved  adroit  in  doing  so,  en¬ 
abling  TCP  to  function  well  in  gigabit  networks  [27], 
long/fat  networks  [18,  10],  satellite  and  wireless  en¬ 
vironments  [22,  7],  among  others.  In  this  paper,  we 
examine  and  improve  TCP’s  performance  in  an  area 
that,  surprisingly,  proves  challenging  to  TCP:  very 
low  delay,  high  throughput,  datacenter  networks  of 
dozens  to  hundreds  of  machines. 

The  problem  we  study  is  termed  incast:  A  drastic 
reduction  in  throughput  when  multiple  senders  com¬ 
municate  with  a  single  receiver  in  these  networks. 
The  highly  bursty,  very  fast  data  overfills  typically 
small  Ethernet  switch  buffers,  causing  intense  packet 
loss  that  leads  to  TCP  timeouts.  These  timeouts  last 
hundreds  of  milliseconds  on  a  network  whose  round- 
trip-time  (RTT)  is  measured  in  the  10s  or  100s  of 
microseconds.  Protocols  that  have  some  form  of  syn¬ 
chronization  requirement — filesystem  reads,  parallel 
data-intensive  queries — block  waiting  for  the  timed- 
out  connections  to  finish.  These  fimeoufs  and  fhe 
resulting  delay  can  reduce  fhroughput  by  90%  (Fig¬ 
ure  1,  200ms  RT Omin)  or  more  [25,  28]. 

In  fhis  paper,  we  presenf  and  evaluate  a  sef  of 
system  exfensions  fo  enable  microsecond-granularify 
fimeoufs  -  fhe  TCP  refransmission  timeout  (RTO). 
The  challenges  in  doing  so  are  threefold:  First,  we 
show  that  the  solution  is  practical  by  modifying  the 
Finux  TCP  implementation  to  use  high-resolution 
kernel  timers.  Second,  we  show  that  these  modifi¬ 
cations  are  effective,  enabling  a  network  testbed  ex¬ 
periencing  incast  to  maintain  maximum  throughput 
for  up  to  47  concurrent  senders,  the  testbed’s  max¬ 
imum  size  (Figure  1,  200qs  RTOmin)-  Microsecond 
granularity  timeouts  are  necessary — simply  reducing 
RTOmin  to  1ms  without  also  improving  the  timing 
granularity  may  not  prevent  incast.  In  simulation, 
our  changes  to  TCP  prevent  incast  for  up  to  2048 
concurrent  senders  on  10  gigabit  Ethernet.  Fastly,  we 
show  that  the  solution  is  safe,  examining  the  effects 
of  this  aggressively  reduced  RTO  in  the  wide-area 
Internet,  showing  that  its  benefits  to  incast  recovery 
have  no  drawbacks  on  performance  for  bulk  flows. 

The  mofivafion  for  solving  fhis  problem  is  fhe  in¬ 
creasing  inferesf  in  using  Efhernef  and  TCP  for  inter- 
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Figure  1:  TCP  Incasf:  A  fhroughpuf  collapse  can  oc¬ 
cur  wifh  increased  numbers  of  concurrenf  senders  in 
a  synchronized  requesf.  Reducing  RT O  fo  microsec¬ 
ond  granularify  alleviates  Incasf. 

processor  communication  and  bulk  storage  fransfer 
applications  in  fhe  fasfesl,  largesl  dafa  cenfers,  in- 
sfead  of  Fibrechannel  or  Infiniband.  Provided  fhaf 
TCP  adequafely  supporfs  high  bandwidfh,  low  la- 
fency,  synchronized  and  parallel  applications,  fhere 
is  a  sfrong  desire  to  “wire-once”  and  reuse  fhe  ma- 
fure,  well-undersfood  fransporf  profocols  fhaf  are  so 
familiar  in  lower  bandwidfh  nef works. 

2  Background 

Cosf  pressures  increasingly  drive  dafacenfers  fo 
adopf  commodify  componenfs,  and  often  low-cosf 
implemenfafions  of  such.  An  increasing  number 
of  clusfers  are  being  builf  wifh  off-fhe-shelf  rack- 
mounf  servers  inferconnecfed  by  Efhernef  swifches. 
While  fhe  adage  “you  gef  whaf  you  pay  for”  sfill 
holds  frue,  enfry-level  gigabif  Efhernef  swifches  fo- 
day  operafe  af  full  dafa  rates,  swifching  upwards  of 
50  million  packefs  per  second — af  a  cosf  of  abouf 
$10  per  porf.  Commodify  lOGbps  Efhernef  is  now 
cosf-compefifive  wifh  specialized  inferconnecfs  such 
as  Infiniband  and  FibreChannel,  and  also  benefifs 
from  wide  brand  recognifion  fo  bool.  To  reduce  cosf, 
however,  swifches  oflen  sacrifice  expensive,  power- 
hungry  SRAM  packel  buffers,  fhe  effecl  of  which  we 
explore  Ihroughouf  fhis  work. 

The  desire  for  commodify  parls  exlends  fo  Irans- 
porl  protocols.  TCP  provides  a  kilchen  sink  of  pro- 
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tocol  features,  giving  reliability  and  retransmission, 
eongestion  and  flow  eontrol,  and  delivering  paekets 
in-order  to  the  reeeiver.  While  not  all  applieations 
need  all  of  these  features  [20,  31]  or  benefit  from 
more  rieh  transport  abstraetions  [15],  TCP  is  ma¬ 
ture  and  well-understood  by  developers,  leaving  it 
the  transport  protoeol  of  ehoiee  even  in  many  high- 
performanee  environments. 

Without  link-level  flow  eontrol,  TCP  is  solely  re¬ 
sponsible  for  eoping  with  and  avoiding  paeket  loss 
in  the  (often  small)  Ethernet  switeh  egress  buffers. 
Unfortunately,  the  workload  we  examine  has  three 
features  that  ehallenge  (and  nearly  eripple)  TCP’s 
performanee:  a  highly  parallel,  synehronized  request 
workload;  buffers  mueh  smaller  than  the  bandwidth- 
delay  produet  of  the  network;  and  low  lateney  that 
results  in  TCP  having  windows  of  only  a  few  paek¬ 
ets. 

2.1  The  Incast  Problem 

Synchronized  request  workloads  are  beeoming  in- 
ereasingly  common  in  today’s  commodity  clusters. 
Examples  include  parallel  reads/writes  in  cluster 
filesystems  such  as  Eustre  [8],  Panasas  [34],  or 
NESvd.l  [33];  search  queries  sent  to  dozens  of 
nodes,  with  results  returned  to  be  sorted*;  or  paral¬ 
lel  databases  that  harness  multiple  back-end  nodes  to 
process  parts  of  queries. 

In  a  clustered  file  system,  for  example,  a  client  ap¬ 
plication  requests  a  data  block  striped  across  several 
storage  servers,  issuing  the  next  data  block  request 
only  when  all  servers  have  responded  with  their  por¬ 
tion.  This  synchronized  request  workload  can  result 
in  packets  overfilling  the  buffers  on  the  client’s  port 
on  the  switch,  resulting  in  many  losses.  Under  se¬ 
vere  packet  loss,  TCP  can  experience  a  timeout  that 
lasts  a  minimum  of  200ms,  determined  by  the  TCP 
minimum  retransmission  timeout  (RT Omin)-  While 
operating  systems  use  a  default  value  today  that  may 
suffice  for  the  wide-area,  datacenters  and  SANs  have 
round-trip-times  that  are  orders  of  magnitude  below 
the  RT Omin  defaults: 

*In  fact,  engineers  at  Facebook  recently  rewrote  the  middle- 
tier  caching  software  they  use — memcached  [13] — to  use  UDP 
so  that  they  could  “implement  application-level  flow  control  for 
...  gets  of  hundreds  of  keys  in  parallel”  [12] 


Scenario 

RTT 

OS 

TCP  RT  Omin 

WAN 

100ms 

Einux 

200ms 

Datacenter 

<lms 

BSD 

200ms 

SAN 

<0.1ms 

Solaris 

400ms 

When  a  server  involved  in  a  synchronized  request 
experiences  a  timeout,  other  servers  can  finish  send¬ 
ing  their  responses,  but  the  client  must  wait  a  mini¬ 
mum  of  200ms  before  receiving  the  remaining  parts 
of  the  response,  during  which  the  client’s  link  may 
be  completely  idle.  The  resulting  throughput  seen  by 
the  application  may  be  as  low  as  1-10%  of  the  client’s 
bandwidth  capacity,  and  the  per-request  latency  will 
be  higher  than  200ms. 

This  phenomenon  was  first  termed  “Incast”  and 
described  by  Nagle  et.  al  [25]  in  the  context  of  paral¬ 
lel  filesystems.  Nagle  et.  al  coped  with  Incast  in  the 
parallel  filesystem  with  application  specific  mecha¬ 
nisms.  Specifically,  Panasas  [25]  limits  the  num¬ 
ber  of  servers  simultaneously  sending  to  one  client 
to  about  10  by  judicious  choice  of  the  file  strip¬ 
ing  policies.  It  also  reduces  the  default  size  of  its 
per-flow  TCP  receive  buffers  (capping  the  adver¬ 
tised  window  size)  on  the  client  to  avoid  incast  on 
switches  with  small  buffers.  Eor  switches  with  large 
buffers,  Panasas  provides  a  mount  option  to  increase 
the  client’s  receive  buffer  size.  In  contrast,  this  work 
provides  a  TCP-level  solution  to  incast  for  switches 
with  small  buffers  and  many  more  than  10  simulta¬ 
neous  senders. 

Without  application-specific  techniques,  the  gen¬ 
eral  problem  of  incast  remains:  Eigure  1  shows  the 
throughput  of  our  test  synchronized-read  application 
(Section  4)  as  we  increase  the  number  of  nodes  it 
reads  from,  using  an  unmodified  Einux  TCP  (200ms 
RTOmin  line).  This  application  performs  synchro¬ 
nized  reads  of  1MB  blocks  of  data;  that  is,  each  of 
N  servers  responds  to  a  block  read  request  with  1 
MB  /  N  bytes  at  the  same  time.  Even  using  a  high- 
performance  switch  (with  its  default  settings),  the 
throughput  drops  drastically  as  the  number  of  servers 
increases,  achieving  a  shockingly  poor  3%  of  the 
network  capacity — about  30Mbps — when  it  tries  to 
stripe  the  blocks  across  all  47  servers. 

Prior  work  characterizing  TCP  Incast  ended  on 
a  somewhat  down  note,  finding  that  existing  TCP 
improvements — NewReno,  SACK  [22],  RED  [14], 
ECN  [30],  Eimited  Transmit  [3],  and  modifica- 
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Average  Goodput  VS  #  Servers 
(SRU  =  256KB) 


Number  of  Servers 

Figure  2:  Doubling  the  switch  egress  buffer  size  dou¬ 
bles  the  numbers  of  concurrent  senders  needed  to  see 
incast. 

tions  to  Slow  Start — sometimes  increased  through¬ 
put,  but  did  not  substantially  change  the  incast- 
induced  throughput  collapse  [28].  This  work  found 
three  partial  solutions:  First,  as  shown  in  Figure  2 
(from  [28]),  larger  switch  buffers  can  delay  the  on¬ 
set  of  Incast  (doubling  the  buffer  size  doubled  the 
number  of  servers  that  could  be  contacted).  But  in¬ 
creased  switch  buffering  comes  at  a  substantial  dollar 
cost  -  switches  with  1MB  packet  buffering  per  port 
may  cost  as  much  as  $500,000.  Second,  Ethernet 
flow  control  was  effective  when  the  machines  were 
on  a  single  switch,  but  was  dangerous  across  inter¬ 
switch  trunks  because  of  head-of-line  blocking.  Fi¬ 
nally,  reducing  TCP’s  minimum  RT O,  in  simulation, 
appeared  to  allow  nodes  to  maintain  high  throughput 
with  several  times  as  many  nodes — but  it  was  left  un¬ 
explored  whether  and  to  what  degree  these  benefits 
could  be  achieved  in  practice,  meeting  the  three  cri¬ 
teria  we  presented  above:  practicality,  effectiveness, 
and  safety.  In  this  paper,  we  answer  these  questions 
in  depth  to  derive  an  effective  solution  for  practical, 
high-fan-in  datacenter  Ethernet  communication. 

3  Challenges  to  Fine-Grained  TCP 
Timeouts 

Successfully  using  an  aggressive  TCP  retransmit 
timer  requires  first  addressing  the  issue  of  safety 
and  generality:  is  an  aggressive  timeout  appropri¬ 
ate  for  use  in  the  wide-area,  or  should  it  be  limited 


to  the  datacenter?  Does  it  risk  increased  congestion 
or  decreased  throughput  because  of  spurious  (incor¬ 
rect)  timeouts?  Second,  it  requires  addressing  im- 
plementability:  TCP  implementations  typically  use 
a  coarse-grained  timer  that  provides  timeout  sup¬ 
port  with  very  low  overhead.  Providing  tighter  TCP 
timeouts  requires  not  only  reducing  or  eliminating 
RT 0,„in,  but  also  supporting  fine-grained  RTT  mea¬ 
surements  and  kernel  timers.  Einally,  if  one  makes 
TCP  timeouts  more  fine-grained,  how  low  must  one 
go  to  achieve  high  throughput  with  the  smallest  ad¬ 
ditional  overhead?  And  to  how  many  nodes  does  this 
solution  scale? 

3.1  Jacobson  RTO  Estimation 

The  standard  RTO  estimator  [17]  tracks  a  smoothed 
estimate  of  the  round-trip  time,  and  sets  the  time¬ 
out  to  this  RTT  estimate  plus  four  times  the  linear 
deviation — roughly  speaking,  a  value  that  lies  out¬ 
side  four  standard  deviations  from  the  mean: 

RTO  =  SRTT  +  {4xRTTVAR)  (1) 

Two  factors  set  lower  bounds  on  the  value  that  the 
RT O  can  achieve:  an  explicit  configuration  parame¬ 
ter,  RT O/nin,  and  the  implicit  effects  of  the  granular¬ 
ity  with  which  RTTs  are  measured  and  with  which 
the  kernel  sets  and  checks  timers.  As  noted  earlier, 
common  values  for  RT Omin  are  200ms,  and  most  im¬ 
plementations  track  RTTs  and  timers  at  a  granularity 
of  1ms  or  larger. 

Because  RTT  estimates  are  difficult  to  collect  dur¬ 
ing  loss  and  timeouts,  a  second  safety  mechanism 
controls  timeout  behavior:  exponential  backoff.  Af¬ 
ter  each  timeout,  the  RT  O  value  is  doubled,  helping 
to  ensure  that  a  single  RT O  set  too  low  cannot  cause 
an  long-lasting  chain  of  retransmissions. 

3.2  Is  it  safe  to  disregard  RT Ominl 

There  are  two  possible  complications  of  permitting 
much  smaller  RT O  values:  spurious  (incorrect)  time¬ 
outs  when  the  network  RTT  suddenly  jumps,  and 
breaking  the  relationship  between  the  delayed  ac¬ 
knowledgement  timer  and  the  RT O  values. 

Spurious  retransmissions:  The  most  promi¬ 
nent  study  of  TCP  rehansmission  showed  that  a 
high  (by  the  standards  of  datacenter  RTTs)  RT  Omin 
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helped  avoid  spurious  retransmission  in  wide-area 
TCP  transfers  [4],  regardless  of  how  good  an  esti¬ 
mator  one  used  based  on  historieal  RTT  informa¬ 
tion.  Intuition  for  why  this  is  the  ease  eomes  from 
prior  [24,  11]  and  subsequent  [35]  studies  of  Inter¬ 
net  delay  ehanges.  While  most  of  the  time,  end-to- 
end  delay  ean  be  modeled  as  random  samples  from 
some  distribution  (and  therefore,  ean  be  predieted  by 
an  RT O  estimator),  the  delay  eonsistently  observes 
both  oeeasional,  unpredietable  delay  spikes,  as  well 
as  shifts  in  the  distribution  from  whieh  the  delay  is 
drawn.  Sueh  ehanges  ean  be  due  to  the  sudden  in- 
troduetion  of  eross-traffie,  routing  ehanges,  or  fail¬ 
ures.  As  a  result,  wide-area  “paeket  delays  [are] 
not  mathematieally  [or]  operationally  steady”  [35], 
whieh  eonfirms  the  Allman  and  Paxson  observation 
that  RT  O  estimation  involves  a  fundamental  tradeoff 
between  rapid  retransmission  and  spurious  retrans¬ 
missions. 

Fortunately,  TCP  timeouts  and  spurious  timeouts 
were  and  remain  rare  events,  as  we  explore  further 
in  Seetion  8.  In  the  ten  years  sinee  the  Allman  and 
Paxson  study,  TCP  variants  that  more  effeetively  re- 
eover  from  loss  have  been  inereasingly  adopted  [23]. 
By  2005,  for  example,  nearly  65%  of  Web  servers 
supported  SACK  [22],  whieh  was  introdueed  only  in 
1996.  In  just  three  years  from  2001 — 2004,  the  num¬ 
ber  of  TCP  Tahoe-based  servers  dropped  drastieally 
in  favor  of  NewReno-style  servers. 

Moreover,  algorithms  to  undo  the  effeets  of  spu¬ 
rious  timeouts  have  been  both  proposed  [4,  21,  32] 
and,  in  the  ease  of  F-RTO,  adopted  in  the  latest  Linux 
implementations.  The  default  F-RTO  settings  eon- 
servatively  halve  the  eongestion  window  when  a  spu¬ 
rious  timeout  is  deteeted  but  remain  in  eongestion 
avoidanee  mode,  thus  avoiding  the  slow-start  phase. 
Given  these  improvements,  we  believe  disregarding 
RTOmin  is  safer  today  than  10  years  ago,  and  See¬ 
tion  8  will  show  measurements  reinforeing  this. 

Delayed  Acknowledgements:  The  TCP  delayed 
ACK  meehanism  attempts  to  reduee  the  amount  of 
ACK  traffie  by  having  a  reeeiver  aeknowledge  only 
every  other  paeket  [9].  If  a  single  paeket  is  reeeived 
with  none  following,  the  reeeiver  will  wait  up  to  the 
delayed  ACK  timeout  threshold  before  sending  an 
ACK.  The  delayed  ACK  threshold  must  be  shorter 
than  the  lowest  RT O  values,  or  a  sender  might  time 
out  waiting  for  an  ACK  that  is  merely  delayed  by  the 


reeeiver.  Modern  systems  set  the  delayed  ACK  time¬ 
out  to  40ms,  with  RT Omin  set  to  200ms. 

Consequently,  a  host  modified  to  reduee  the 
RT  Omin  below  40ms  would  periodieally  expe- 
rienee  an  unneeessary  timeout  when  eommuni- 
eating  with  unmodified  hosfs,  speeifieally  when 
fhe  RTT  is  below  40ms  (e.g.,  in  fhe  dafa- 
eenfer  and  for  shorf  flows  on  fhe  wide-area). 
As  we  show  in  fhe  fol¬ 
lowing  seefion,  delayed 
ACKs  fhemselves  impair 
performanee  in  ineasf 
environmenfs:  we  suggesf 
disabling  fhem  entirely. 

This  solufion  requires 
elienf  parfieipafion,  how¬ 
ever,  and  so  is  nol  general. 

In  fhe  following  seefion,  we  diseuss  fhree  praefieal 
solufions  for  inferoperabilify  wifh  unmodified, 
delayed-ACK-enabled  elienfs. 

As  a  eonsequenee,  fhere  are  few  a-priori  reasons 
fo  believe  fhaf  eliminafing  RT  Omin — ^basing  timeouts 
solely  upon  the  Jaeobson  estimator  and  exponen¬ 
tial  baekoff — would  greatly  harm  wide-area  perfor¬ 
manee  and  dataeenter  environments  with  elients  us¬ 
ing  delayed  ACK.  We  evaluate  these  questions  ex¬ 
perimentally  in  Seetions  4  and  8. 


4  Evaluating  Throughput  with 
Fine-Grained  RTO 

How  low  must  the  RTO  be  allowed  to  go  to  re¬ 
tain  high  throughput  under  ineast-produeing  eondi- 
tions,  and  to  how  many  servers  does  this  solution 
seale?  We  explore  this  question  using  real-world 
measurements  and  ns-2  simulations  [26],  finding  fhaf 
fo  be  maximally  effeelive,  fhe  fimers  musf  operafe 
on  a  granularity  elose  fo  fhe  RTT  of  fhe  nefwork — 
hundreds  of  mieroseeonds  or  less. 

Test  application:  Striped  requests.  The  test 
elienf  issues  a  request  for  a  bloek  of  data  that  is 
striped  aeross  N  servers  (the  “stripe  width”).  Eaeh 
server  responds  with  bytes  of  data.  Only 

after  it  reeeives  the  full  response  from  every  server 
will  the  elient  issue  requests  for  the  subsequent  data 
bloek.  This  design  mimies  the  request  patterns  found 
in  several  eluster  filesystems  and  several  parallel 
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workloads.  Observe  that  as  the  number  of  servers 
increases,  the  amount  of  data  requested  from  each 
server  decreases.  We  run  each  experiment  for  200 
data  block  transfers  to  observe  steady-state  perfor¬ 
mance,  calculating  the  goodput  (application  through¬ 
put)  over  the  entire  duration  of  the  transfer. 

We  select  the  block  size  (1MB)  based  upon  read 
sizes  common  in  several  distributed  filesystems,  such 
as  GFS  [16]  and  PanFS  [34],  which  observe  work¬ 
loads  that  read  on  the  order  of  a  few  kilobytes  to  a 
few  megabytes  at  a  time.  Prior  work  suggests  that 
the  block  size  shifts  the  onset  of  incast  (doubling  the 
block  size  doubles  the  number  of  servers  before  col¬ 
lapse),  but  does  not  substantially  change  the  system’s 
behavior;  different  systems  have  their  own  “natural” 
block  sizes.  The  mechanisms  we  develop  improve 
throughput  under  incast  conditions  for  any  choice  of 
block  sizes  and  buffer  sizes. 

In  Simulation:  We  simulate  one  client  and  multi¬ 
ple  servers  connected  through  a  single  switch  where 
round-trip-times  under  low  load  are  lOOqs.  Each 
node  has  IGbps  capacity,  and  we  configure  fhe 
swifch  buffers  wifh  32KB  of  oufpuf  buffer  space  per 
porf,  a  size  chosen  based  on  sfafisfics  from  commod¬ 
ify  IGbps  Efhernef  swifches.  Because  ns-2  is  an 
evenf-based  simulafion,  fhe  fimer  granularify  is  in- 
finife,  hence  we  invesfigafe  fhe  effecf  of  RT Omin  to 
undersfand  how  low  fhe  RT O  needs  fo  be  fo  avoid 
Incasf  fhroughpuf  collapse.  Additionally,  we  add  a 
small  random  fimer  scheduling  delay  of  up  fo  20qs 
fo  more  accurafely  model  real-world  scheduling  vari¬ 
ance.^ 

Eigure  3  depicfs  fhroughpuf  as  a  function  of  fhe 
RT  Omin  for  sfripe  widfhs  befween  4  and  128  servers. 
Throughpuf  using  fhe  defaulf  200ms  RT  Omin  drops 
by  nearly  an  order  of  magnifude  wifh  8  concurrenf 
senders,  and  by  nearly  fwo  orders  of  magnifude  when 
dafa  is  sfriped  across  64  and  128  servers. 

Reducing  fhe  RTOmin  to  1ms  is  effective  for  8-16 
concurrenf  senders,  fully  utilizing  fhe  clienf’s  link, 
buf  begins  fo  suffer  when  dafa  is  sfriped  across  a 
larger  number  of  servers:  128  concurrenf  senders  ufi- 
lize  only  50%  of  fhe  available  link  bandwidfh  even 

^Experience  with  ns-2  showed  that  without  introducing  this 
delay,  we  saw  simultaneous  retransmissions  from  many  servers 
all  within  a  few  microseconds,  which  is  rare  in  real  world  set¬ 
tings. 


RTOmin  vs  Goodput 
(Block  size  =  1  MB,  buffer  =  32KB) 


Eigure  3:  Reducing  fhe  RTOmin  in  simulation  fo  mi¬ 
croseconds  from  fhe  currenf  defaulf  value  of  200ms 
improves  goodpuf. 


RTOmin  vs  Goodput 

(Fixed  Block  =  1MB,  buffer  =  32KB  (estimate)) 


Eigure  4:  Experimenfs  on  a  real  clusfer  validafe  fhe 
simulafion  resulf  fhaf  reducing  fhe  RTOmin  to  mi¬ 
croseconds  improves  goodpuf. 

wifh  a  1ms  RTOmin-  Eor  64  and  128  servers  and  low 
RTOmin  values,  we  note  fhaf  each  individual  flow 
does  nof  have  enough  dafa  fo  send  fo  safurafe  fhe 
link,  buf  also  fhaf  performance  for  200qs  is  worse 
fhan  for  1ms;  we  address  fhis  issue  in  more  defail 
when  scaling  fo  hundreds  of  concurrenf  senders  in 
Section  7. 

In  Real  Clusters:  We  evaluate  incasf  on  fwo  clus- 
fers;  one  sixteen- node  clusfer  using  an  HP  Procurve 
2848  swifch,  and  one  48-node  clusfer  using  a 
EorcelO  S50  swifch.  In  fhese  clusters,  every  node  has 
1  Gbps  links  and  a  clienf-fo-server  RTT  of  approxi¬ 
mately  lOOqs.  All  nodes  run  Einux  kernel  2.6.28. 
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We  run  the  same  synehronized  read  workload  as  in 
simulation. 

For  these  experiments,  we  use  our  modified  Linux 
2.6.28  kernel  that  uses  mieroseeond-aeeurate  timers 
with  mieroseeond-granularity  RTT  estimation  (§6) 
to  be  able  to  aeeurately  set  the  RTOmin  to  a  de¬ 
sired  value.  Without  these  modifieations,  the  default 
TCP  timer  granularity  in  Linux  ean  be  redueed  only 
to  1ms.  As  we  show  later,  when  added  to  the  4x 
RTTVAR  estimate,  the  1ms  timer  granularity  effee- 
tively  raises  the  minimum  RT O  over  5ms. 

Figure  4  plots  the  applieation  goodput  as  a  fune- 
tion  of  the  RTOmin  for  4,  8,  and  16  eoneurrent 
senders.  For  all  eonfigurations,  goodput  drops  with 
inereasing  RTOmin  above  1ms.  For  8  and  16  eoneur¬ 
rent  senders,  the  default  RT Omin  of  200ms  results  in 
nearly  2  orders  of  magnitude  drop  in  throughput. 

The  real  world  results  deviate  from  the  simu¬ 
lation  results  in  a  few  minor  ways.  First,  the 
maximum  aehieved  throughput  in  simulation  nears 
IGbps,  whereas  the  maximum  aehieved  in  the  real 
world  is  900Mbps.  Simulation  throughput  is  always 
higher  beeause  simulated  nodes  are  infinitely  fast, 
whereas  real-world  nodes  are  subjeet  to  myriad  in- 
fluenees,  ineluding  OS  seheduling  and  Ethernet  or 
switeh  timing  differenees,  resulting  in  real-world  re¬ 
sults  slightly  below  that  of  simulation. 

Seeond,  real  world  results  show  negligible  differ- 
enee  between  8  and  16  servers,  while  the  differenees 
are  more  pronouneed  in  simulation.  We  attribute  this 
to  varianees  in  the  buffering  between  simulation  and 
the  real  world.  As  we  show  in  Seetion  7,  small  mi- 
eroseeond  differenees  in  retransmission  seheduling 
ean  lead  to  improved  goodput;  these  differenees  ex¬ 
ist  in  the  real  world  but  are  not  modeled  as  aeeurately 
in  simulation. 

Third,  the  real  world  results  show  identieal  perfor- 
manee  for  RT  Omin  values  of  200/rs  and  1ms,  whereas 
there  are  slight  differenees  in  simulation.  We  find 
fhaf  fhe  RTT  and  varianee  seen  in  fhe  real  world 
is  higher  fhan  fhaf  seen  in  simulafion.  Figure  5 
shows  fhe  disfribufion  of  round-frip-fimes  during  an 
Ineasf  workload.  While  fhe  baseline  RTTs  ean  be  be- 
fween  50-100/rs,  inereased  eongesfion  eauses  RTTs 
fo  rise  fo  400/rs  on  average  wifh  spikes  as  high  as 
1ms.  Henee,  fhe  varianee  of  fhe  RT O  eombined  wifh 
fhe  higher  RTTs  mean  fhaf  fhe  aefual  refransmission 
timers  sef  by  fhe  kernel  are  befween  l-3ms,  where 


RTT  Distribution  in  SAN 


RTT  in  Microseconds 

Figure  5:  During  an  ineasf  experimenf  on  a  16-node 
elusfer,  RTTs  inerease  by  4  limes  fhe  baseline  RTT 
(100/rs)  on  average  wifh  spikes  as  high  as  1ms.  This 
produees  RT O  values  in  fhe  range  of  l-3ms,  resulting 
in  an  RTOmin  of  1ms  being  as  effeelive  as  200/rs  in 
praeliee. 

an  RTOmin  will  show  no  improvemenl  below  1ms. 
Henee,  where  we  speeify  a  RT  Omin  of  200/rs,  we  are 
effeelively  eliminaling  fhe  RT  Omin,  allowing  RT  O  fo 
be  as  low  as  ealeulaled. 

Despite  Ihese  differenees,  fhe  real  world  resulfs 
show  fhe  need  fo  reduee  fhe  RT  O  fo  al  leasl  1ms  fo 
avoid  Ihroughpul  degradation. 

4.1  Interaction  with  Delayed  ACK  for  Un¬ 
modified  Clients 

During  fhe  onsel  of  In- 
easf  using  five  or  fewer 
servers,  fhe  delayed 
ACK  meehanism  aefed 
as  a  minialure  fimeoul, 
resulting  in  redueed,  bul 
nol  ealaslrophieally  low, 

Ihroughpul  [28]  during 
eerlain  loss  patterns. 

As  shown  fo  fhe  righl, 
delayed  ACK  ean  delay 
fhe  reeeipl  of  enough  duplieale  aeks  for  dala-driven 
loss  reeovery  when  fhe  window  is  small  (or  reduee 
fhe  number  of  duplieale  ACKs  by  one,  fuming  a 
reeoverable  loss  info  a  fimeoul).  While  Ihis  delay 
is  nol  as  high  as  a  full  200ms  RTO,  40ms  is  still 
large  eompared  fo  fhe  RTT  and  resulfs  in  low 
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throughput  for  three  to  five  concurrent  senders. 
Beyond  five  senders,  high  packef  loss  creafes  200ms 
refransmission  fimeoufs  which  mask  fhe  impacf  of 
delayed  ACK  delays. 

While  servers  require  modificalions  fo  fhe  TCP 
slack  fo  enable  microsecond-resolulion  refransmis¬ 
sion  fimeoufs,  fhe  clienls  issuing  fhe  requesfs  do  nol 
necessarily  need  fo  be  modified.  Bui  because  de¬ 
layed  ACK  is  implemenfed  al  fhe  receiver,  a  server 
may  normally  wail  40ms  for  an  unacked  segmenl  lhaf 
successfully  arrived  af  fhe  receiver.  For  servers  us¬ 
ing  a  reduced  RT  0  in  a  dalacenler  environmenl,  fhe 
server’s  refransmission  timer  may  expire  long  before 
fhe  unmodified  clienl’s  40ms  delayed  ACK  timer 
fires.  As  a  resull,  fhe  server  will  fimeouf  and  resend 
the  unacked  packet,  cutting  ssthresh  in  half  and  re¬ 
discovering  link  capacity  using  slow-start.  Because 
the  client  acknowledges  the  retransmitted  segment 
immediately,  the  server  does  not  observe  a  coarse¬ 
grained  40ms  delay,  only  an  unnecessary  timeout. 

Figure  6  shows  the  performance  difference  be¬ 
tween  our  modified  clienl  wilh  delayed  ACK  dis¬ 
abled,  delayed  ACK  enabled  wifh  a  200qs  timer,  and 
a  sfandard  kernel  wilh  delayed  ACK  enabled. 

Beyond  8  servers,  our  modified  clienl  wilh  de¬ 
layed  ACK  enabled  receives  15 -30Mbps  lower 
Ihroughpul  compared  fo  fhe  clienl  wilh  delayed  ACK 
disabled,  whereas  fhe  sfandard  clienl  experiences  be- 
Iween  100  and  200Mbps  lower  Ihroughpul.  When 
fhe  clienl  delays  an  ACK,  fhe  sfandard  clienl  forces 
the  servers  to  timeout,  yielding  much  worse  perfor¬ 
mance.  In  contrast,  the  200qs  minimum  delayed 
ACK  timeout  client  delays  a  server  by  roughly  a 
round-trip-time  and  does  not  force  the  server  to  time¬ 
out,  so  the  performance  hit  is  much  smaller. 

Delayed  ACK  can  provide  benefits  where  the 
ACK  path  is  congested  [6],  but  in  the  datacenter  en¬ 
vironment,  we  believe  that  delayed  ACK  should  be 
disabled;  most  high-performance  applications  favor 
quick  response  over  an  additional  ACK-processing 
overhead  and  are  typically  equally  provisioned  for 
both  directions  of  traffic.  Our  evaluafions  in  Sec¬ 
tion  6  disable  delayed  ACK  on  fhe  clienl  for  Ihis 
reason.  While  Ihese  resulls  show  lhaf  for  full  per¬ 
formance,  delayed  ACK  should  be  disabled,  we  nole 
lhaf  unmodified  clienls  still  achieve  good  perfor¬ 
mance  and  avoid  Incasl. 


Num  Servers  vs  Goodput  (DelayedACK  Client) 

(Fixed  Block  =  1  MB,  buffer  =  32KB  (est.),  Switch  =  Procurve) 
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Figure  6:  Disabling  Delayed  ACK  on  clienl  nodes 
provides  oplimal  goodpul. 

5  Preventing  Incast 

In  Ihis  section  we  briefly  oufline  fhe  Ihree  compo- 
nenfs  of  our  solufion  fo  incasl,  fhe  necessify  for  and 
effecliveness  of  which  we  evaluale  in  fhe  following 
section. 

1.  Microsecond-granularity  timeouts.  We  first 
modify  the  kernel  to  measure  RTFs  in  microseconds, 
changing  both  the  infrastructure  used  for  timing  and 
the  values  placed  in  the  TCP  timestamp  option.  The 
kernel  changes  use  the  high-resolution,  hardware- 
supported  timing  hardware  present  in  modern  oper¬ 
ating  systems.  In  Section  9,  we  briefly  discuss  alter¬ 
natives  for  legacy  hardware. 

2.  Randomized  timeout  delay.  Second,  we  mod¬ 
ify  the  microsecond-granularity  RTO  values  to  in¬ 
clude  an  additional  randomized  component: 

timeout  =  RTO  +  {rand (0.5)  *RTO) 

In  expectation,  this  increases  the  duration  of  the 
RT O  by  up  to  25%,  but  more  importantly,  has  the  ef¬ 
fect  of  de-synchronizing  packet  retransmissions.  As 
we  show  in  the  next  section,  this  decision  is  unimpor¬ 
tant  with  smaller  numbers  of  servers,  but  becomes 
critical  when  the  number  of  servers  is  large  enough 
that  a  single  batch  of  retransmitted  packets  them¬ 
selves  cause  extensive  congestion.  The  extra  half- 
RT O  allows  some  packets  to  get  through  the  switch 
(the  RTO  is  at  least  one  RTT,  so  half  an  RTO  is 
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at  least  the  one-way  transit  time)  before  being  elob- 
bered  by  the  retransmissions. 


3.  Disabling  delayed  ACKs.  As  noted  earlier,  de¬ 
layed  ACKs  ean  have  a  poor  interaetion  with  mi- 
eroseeond  timeouts:  the  sender  may  timeout  on  the 
last  paeket  in  a  request,  retransmitting  it  while  wait¬ 
ing  for  a  delayed  ACK  from  the  reeeiver.  In  the 
dataeenter  environment,  the  redueed  window  size  is 
relatively  unimportant — maximum  window  sizes  are 
small  already  and  the  RTT  is  short.  This  retransmis¬ 
sion  does,  however,  ereate  overhead  when  requests 
are  small,  redueing  throughput  by  roughly  10%. 

While  the  foeus  of  this  paper  is  on  understanding 
the  limits  of  providing  high  throughput  in  dataeen¬ 
ter  networks,  it  is  important  to  ensure  that  these  so¬ 
lutions  ean  be  adopted  inerementally.  Towards  this 
end,  we  diseuss  briefly  three  meehanisms  for  inter- 
aeting  with  legaey  elients: 

•  Bimodal  timeout  operation:  The  kernel  eould 
use  mieroseeond  RTTs  only  for  extremely  low 
RTT  eonneetions,  where  the  retransmission 
overhead  is  mueh  lower  than  the  eost  of  eoarse- 
grained  timeouts,  but  still  use  eoarse  timeouts 
in  the  wide-area. 

•  Make  aggressive  timeouts  a  socket  option:  Re¬ 
quire  users  to  explieitly  enable  aggressive  time¬ 
outs  for  their  applieation.  We  believe  this  is  an 
appropriate  option  for  the  first  stages  of  deploy¬ 
ment,  while  elients  are  being  upgraded. 

•  Disable  delayed  ACKs,  or  use  an  adaptive  de¬ 
layed  ACK  timer.  While  the  former  has  nu¬ 
merous  advantages,  fixing  the  delay  penalty  of 
delayed  ACK  for  the  dataeenter  is  relatively 
straightforward:  Base  the  delayed  ACK  timeout 
on  the  smoothed  inter-paeket  arrival  rate  instead 
of  having  a  statie  timeout  value.  We  have  not 
implemented  this  option,  but  as  Figure  6  shows, 
a  statie  200q  s  timeout  value  (still  larger  than  the 
inter-paeket  arrival)  shows  that  these  more  pre- 
eise  delayed  ACKs  restore  throughput  to  about 
98%  of  the  no-delayed-ACK  throughput. 


6  Achieving  Microsecond- 

granularity  Timeouts 

The  TCP  eloek  granularity  in  most  popular  operat¬ 
ing  systems  is  on  the  order  of  milliseeonds,  as  de¬ 
fined  by  the  “jiffy”  eloek,  a  global  eounter  updated 
by  the  kernel  at  a  frequeney  “HZ”,  where  HZ  is  typi- 
eally  100,  250,  or  1000.  Linux,  for  example,  updates 
the  jiffy  timer  250  times  a  seeond,  yielding  a  TCP 
eloek  granularity  of  4ms,  with  a  eonfiguration  option 
to  update  1000  times  per  seeond  for  alms  granular¬ 
ity.  More  frequent  updates,  as  would  be  needed  to 
aehieve  finer  granularity  timeouts,  would  impose  an 
inereased,  system-wide  kernel  interrupt  overhead. 

Unfortunately,  setting  the  RTOmin  to  1  jiffy  (the 
lowest  possible  value)  does  not  aehieve  RT O  values 
of  1ms  beeause  of  the  eloek  granularity.  TCP  mea¬ 
sures  RTTs  in  1ms  granularity  at  best,  so  both  the 
smoothed  RTT  estimate  and  RTT  varianee  have  a  1 
jiffy  (1ms)  lower  bound.  Sinee  the  standard  RTO 
estimator  sums  the  RTT  estimate  with  4x  the  RTT 
varianee,  the  lowest  possible  RT O  value  is  5  jiffies. 
We  experimentally  validated  this  result  by  setting  the 
eloek  granularity  to  1ms,  setting  RT Omin  to  1ms,  and 
observing  that  TCP  timeouts  were  a  minimum  of 
5ms. 

At  a  minimum  possible  RTOmin  of  5ms  in  stan¬ 
dard  TCP  implementations,  throughput  loss  ean  not 
be  avoided  for  as  few  as  8  eoneurrent  senders. 
While  the  results  of  Figures  3  and  4  suggest  redue¬ 
ing  the  RT  Omin  to  5ms  ean  be  both  simple  (a  one- 
line  ehange)  and  helpful,  next  we  deseribe  how  to 
aehieve  mieroseeond  granularity  RT O  values  in  the 
real  world. 

6.1  Linux  high-resolution  timers:  hrtimers 

High  resolution  timers  were  introdueed  in  Linux  ker¬ 
nel  version  2.6.18  and  are  still  aetively  in  devel¬ 
opment.  They  form  the  basis  of  the  posix-timer 
and  itimer  user-level  timers,  nanosleep,  and  a  few 
other  in-kernel  operations,  ineluding  the  update  of 
the  jiffies  value. 

The  Generie  Time  of  Day  (GTOD)  framework 
provides  the  kernel  and  other  applieations  with 
nanoseeond  resolution  timing  using  the  CPU  eyele 
eounter  on  modern  proeessors.  The  hrtimer  imple¬ 
mentation  interfaees  with  the  High  Preeision  Event 
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Timer  (HPET)  hardware  also  available  on  modern 
systems  to  achieve  microsecond  resolution  in  the  ker¬ 
nel.  When  a  timer  is  added  to  the  list,  the  kernel 
checks  whether  this  is  the  next  expiring  timer,  pro-  ^ 
gramming  the  HPET  to  send  an  interrupt  when  the  1. 
HPET’s  internal  clock  advances  by  a  desired  amount.  |. 

O 

Eor  example,  the  kernel  may  schedule  a  timer  to  ex-  g 
pire  once  every  1ms  to  update  the  jiffy  counter,  and 
the  kernel  will  be  interrupted  by  the  HPET  to  update 
the  jiffy  timer  only  every  1ms. 

Our  preliminary  evaluations  of  hrtimer  overhead 
have  shown  no  appreciable  overhead  of  implement¬ 
ing  TCP  timeouts  using  the  hrtimer  subsystem.  We 
posit  that,  at  this  stage  in  development,  only  a  few 
kernel  functions  use  hrtimers,  so  the  red-black  tree 
that  holds  the  list  of  timers  may  not  contain  enough 
timers  to  see  poor  performance  as  a  result  of  repeated 
insertions  and  deletions.  Also,  we  argue  that  for  in¬ 
cast,  where  high-resolution  timers  for  TCP  are  re¬ 
quired,  any  introduced  overhead  may  be  acceptable, 
as  it  removes  the  idle  periods  that  prevent  the  server 
from  doing  useful  work  to  begin  with. 

6.2  Modifications  to  the  TCP  Stack 

The  Einux  TCP  implementation  requires  three 
changes  to  support  microsecond  timeouts  using 
hrtimers:  microsecond  resolution  time  accounting 
to  track  RTTs  with  greater  precision,  redefinition  of 
TCP  constants,  and  replacement  of  low-resolution 
timers  with  hrtimers. 


Num  Servers  vs  Goodput 

(Fixed  Block  =  1  MB,  buffer  =  32KB  (est.),  Switch  =  Procurve) 
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700 
600 
500 
400 
300 
200 
100 
0 

OtM'.tcDOOOCM’=tcD 

Number  of  Servers 

200us  RTOmin  ■  200ms  RTOmin  . . . 

1ms  RTOmin  Jiffy 

Eigure  7:  On  a  16  node  cluster,  our  high-resolution 
1ms  RTOmin  eliminates  incast.  The  jiffy-based  im¬ 
plementation  has  a  Sms  lower  bound  on  RT O,  and 
achieves  only  65%  throughput. 


complished  entirely  on  the  sender — ^receivers  simply 
echo  back  the  value  in  the  TCP  timestamp  option. 


Constants,  Timers  and  Socket  Structures  All 

timer  constants  previously  defined  with  respect  to  the 
jiffy  timer  are  converted  to  exact  microsecond  val¬ 
ues.  The  TCP  implementation  must  make  use  of  the 
hrtimer  interface:  we  replace  the  standard  timer  ob¬ 
jects  in  the  socket  structure  with  the  hrtimer  struc¬ 
ture,  ensuring  that  all  subsequent  calls  to  set,  reset,  or 
clear  these  timers  use  the  appropriate  hrtimer  func¬ 
tions. 


Microsecond  accounting:  By  default,  the  jiffy 
counter  is  used  for  tracking  time.  To  provide  mi¬ 
crosecond  granularity  accounting,  we  use  the  GTOD 
framework  to  access  the  64-bit  nanosecond  resolu¬ 
tion  hardware  clock  wherever  the  jiffies  time  is  tradi¬ 
tionally  used. 

With  the  TCP  timestamp  option  enabled,  RTT  es¬ 
timates  are  calculated  based  on  the  difference  be¬ 
tween  the  timestamp  option  in  an  earlier  packet  and 
the  corresponding  ACK.  We  convert  the  time  from 
nanoseconds  to  microseconds  and  store  the  value  in 
the  TCP  timestamp  option^.  This  change  can  be  ac- 

^The  lower  wrap-around  time  -  2^^  microseconds  or  4294 
seconds  -  is  still  far  greater  than  the  maximum  IP  segment  life¬ 
time  (120-255  seconds) 


6.3  hrtimer  Results 

Eigure  7  presents  the  achieved  goodput  as  we  in¬ 
crease  the  stripe  width  N  using  various  RT Omin  val¬ 
ues  on  a  Procurve  2848  switch.  As  before,  the  client 
issues  requests  for  1MB  data  blocks  striped  over 
N  servers,  issuing  the  next  request  once  the  previ¬ 
ous  data  block  has  been  received.  Using  the  de¬ 
fault  200ms  RTOmin,  throughput  plummets  beyond 
8  concurrent  senders.  Eor  a  1ms  jiffy-based 
throughput  begins  to  drop  at  8  servers  to  about  70% 
of  link  capacity  and  slowly  decreases  thereafter;  as 
shown  previously,  the  effective  RT  Omin  is  Sms.  East, 
our  TCP  hrtimer  implementation  allowing  microsec¬ 
ond  RT O  values  achieves  the  maximum  achievable 
goodput  for  16  concurrent  senders. 
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Num  Servers  vs  Goodput 

(Fixed  Block  =  1  MB,  buffer  =  64KB  (est.),  Switch  =  S50) 


200us  RTOmin 
1  ms  RTOmin  Jiffy 
200ms  RTOmin  (default) 


Figure  8:  For  a  48-node  cluster,  providing  RTO  val¬ 
ues  in  microseconds  eliminates  incast. 


Num  Servers  vs  Goodput 
(Fixed  Block  =  1  MB,  buffer  =  51 2KB  (est.), 
Switch  =  S50  with  QoS  Disabled) 


Number  of  Servers 

1  ms  RTOmin  Jiffy  . 

200ms  RTOmin  (default)  . 

Figure  9:  Switches  configured  with  large  enough 
buffer  capacity  can  delay  incast. 

We  verify  these  results  on  a  second  cluster  con¬ 
sisting  of  1  client  and  47  servers  connected  to  a 
single  48-port  ForcelO  S50  switch  (Figure  8).  The 
microsecond  RTO  kernel  is  again  able  to  saturate 
throughput  for  47  servers,  the  testbed’s  maximum 
size.  The  1ms  RTOmin  jiffy-based  configuration  ob¬ 
tained  70-80%  throughput,  with  an  observable  drop 
above  40  concurrent  senders. 

When  the  ForcelO  S50  switch  is  configured  fo  dis¬ 
able  mulfiple  QoS  queues,  fhe  per-porf  packef  buffer 
allocafion  is  large  enough  fhaf  incasf  can  be  avoided 
for  up  fo  47  servers  (Figure  9).  This  reaffirms  fhe 
simulation  resulf  of  Figure  2  -  fhaf  one  way  fo  avoid 
incasf  in  pracfice  is  fo  use  larger  per-porf  buffers  in 
swifches  on  fhe  pafh.  If  also  emphasizes  fhaf  rely¬ 
ing  on  swifch-based  incasf  solufions  involves  more 


RTT  Distribution  at  Los  Alamos  National  Lab  Storage  Node 


RTT  in  Microseconds  (binsize  =  lOus) 


Figure  10:  Disfribufion  of  RTTs  shows  an  apprecia¬ 
ble  number  of  RTTs  in  fhe  10s  of  microseconds. 

fhan  jusf  fhe  swifches  fofal  buffer  size:  swifch  con- 
figurafion  opfions  designed  for  ofher  workloads  can 
make  ifs  flows  more  prone  fo  incasf.  A  generalized 
TCP  solufion  should  reduce  adminisfralive  complex¬ 
ifies  in  fhe  field. 

Overall,  we  find  fhaf  enabling  microsecond  RT  O 
values  in  TCP  successfully  avoids  fhe  incasf  fhrough- 
puf  drop  in  fwo  real-world  clusfers  for  as  high  as  47 
concurrenf  servers,  fhe  maximum  available  fo  us  fo 
dale,  and  fhaf  microsecond  resolufion  is  necessary 
fo  achieve  high  performance  wifh  some  swifches  or 
swifch  configurations. 

7  Next-generation  Datacenters 

TCP  Incasf  poses  further  problems  for  fhe  nexf  gen¬ 
eration  of  dafacenfers  consisfing  of  lOGbps  nefworks 
and  hundreds  fo  Ihousands  of  machines.  These  nel- 
works  have  very  low-lafency  capabilifies  fo  keep 
compelilive  wifh  alfemalive  high-performance  in- 
ferconnecls  like  Infiniband;  fhough  TCP  kernel-fo- 
kernel  latency  may  be  as  high  as  40-80qs  due  fo  ker¬ 
nel  scheduling,  porf-fo-porf  latency  can  be  as  low 
as  lOqs.  Because  lOGbps  Efhemef  provides  higher 
bandwidfh,  servers  will  be  able  fo  send  fheir  portion 
of  a  dafa  block  fasler,  requiring  fhaf  RT O  values  be 
shorter  fo  avoid  idle  link  time.  For  example,  we  plof 
fhe  disfribufion  of  RTTs  seen  al  a  slorage  node  af  Los 
Alamos  National  Lab  (LANL)  in  Ligure  10:  20%  of 
RTTs  are  below  lOOqs,  showing  fhaf  nefworks  today 
are  capable  and  operafe  in  very  low-lafency  environ- 
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Average  Goodput  VS  #  Servers 
(Block  size  =  40MB,  buffer  =  32KB,  rtt  =  20us) 


Number  of  Servers 


Repeated  Retransmissions,  Backoff  and  Idle-time 
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time  (seconds) 

Instanteous  Link  Utilization  - 

Flow  503  Failed  Retransmission  ▼ 

Flow  503  Successful  Retransmission  *■ 


Figure  11:  In  simulation,  flows  still  experience  re¬ 
duced  goodput  even  with  microsecond  resolution 
timeouts  {RT  0,„i„  =  RTT  =  lO/Js)  without  a  random¬ 
ized  RTO  component. 


Figure  12:  Some  flows  experience  repeated  retrans¬ 
mission  failures  due  to  synchronized  retransmission 
behavior,  delaying  transmission  far  beyond  when  the 
link  is  idle. 


ments,  further  motivating  the  need  for  microsecond- 
granularity  RT O  values. 

Scaling  to  thousands  of  concurrent  senders:  In 

this  section,  we  analyze  the  impact  of  incast  and  the 
reduced  RT O  solution  for  lOGbps  Ethernet  networks 
in  simulation  as  we  scale  the  number  of  concurrent 
senders  into  the  thousands.  We  reduce  baseline  RTTs 
from  100,us  to  20/rs,  and  increase  link  capacity  to 
lOGbps,  keeping  per-port  buffer  size  at  32KB  as  we 
assume  smaller  buffers  for  faster  switches. 

We  increase  the  blocksize  to  40MB  (to  ensure  each 
flow  can  mostly  saturate  the  lOGbps  link),  scale  up 
the  number  of  nodes  from  32  to  2048,  and  reduce  the 
RT Omin  to  20,us,  effectively  eliminating  a  minimum 
bound.  In  Figure  11,  we  find  that  without  a  random¬ 
ized  RTO  component,  goodput  decreases  sublinearly 
(note  the  log-scale  x-axis)  as  the  number  of  nodes 
increases,  indicating  that  even  with  an  aggressively 
RT O  granularity,  we  still  observe  reduced  goodput. 

Reduced  goodput  can  arise  due  to  idle  link  time 
or  due  to  retransmissions;  retransmissions  factor  into 
throughput  but  not  goodput.  Figure  1 1  shows  that  for 
a  small  number  of  flows,  throughput  is  near  optimal 
but  goodput  is  lower,  sometimes  by  up  to  15%.  For 
a  larger  number  of  flows,  however,  throughput  and 
goodput  are  nearly  identical  -  with  an  aggressively 
low  RT O,  there  exist  periods  where  the  link  is  idle 
for  a  large  number  of  concurrent  senders. 


These  idle  periods  occur  specifically  when  there 
are  many  more  flows  than  the  amount  of  buffer  ca¬ 
pacity  at  the  switch  due  to  simultaneous,  successive 
timeouts.  Recall  that  after  every  timeout,  the  RT  O 
value  is  doubled  until  an  ACK  is  received.  This  has 
been  historically  safe  because  TCP  quickly  and  con¬ 
servatively  estimates  the  duration  to  wait  until  con¬ 
gestion  abates.  However,  the  exponentially  increas¬ 
ing  delay  can  overshoot  some  portion  of  time  that 
the  link  is  actually  idle,  leading  to  sub-optimal  good- 
put.  Because  only  one  flow  must  overshoot  to  delay 
the  entire  transfer,  the  probability  of  overshooting  in¬ 
creases  with  increased  number  of  flows. 

Figure  12  shows  the  instantaneous  link  utilization 
for  all  flows  and  the  retransmission  events  for  one 
of  the  flows  that  experienced  repeated  retransmission 
failures  during  an  incast  simulation  on  a  IGbps  net¬ 
work.  This  flow  timed  out  and  retransmitted  a  packet 
at  the  same  time  that  other  timed  out  flows  also  re¬ 
transmitted.  While  some  of  these  flows  got  through 
and  saturated  the  link  for  a  brief  period  of  time,  the 
flow  shown  here  timed  out  and  doubled  its  timeout 
value  (until  the  maximum  factor  of  64  *  RTO)  for 
each  failed  retransmission.  The  link  then  became 
available  soon  after  the  retransmission  event,  but  the 
RTO  backoff  set  the  retransmission  timer  to  fire  far 
beyond  this  time.  When  this  packet  eventually  got 
through,  the  block  transfer  completed  and  the  next 
block  transfer  began,  but  only  after  large  periods  of 
link  idle  time  that  reduced  goodput. 
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Average  Goodput  VS  #  Servers 
(Block  size  =  40MB,  buffer  =  32KB,  rtt  =  20us) 


Number  of  Servers 


Figure  13:  In  simulation,  introducing  a  random¬ 
ized  component  to  the  RT  O  desynchronizes  retrans¬ 
missions  following  timeouts  and  avoiding  goodput 
degradation  for  a  large  number  of  flows. 

This  analysis  shows  that  decreased  good¬ 
put/throughput  for  a  large  number  of  flows  can  be 
attributed  to  many  flows  timing  out  simultaneously, 
backing  off  deterministically,  and  retransmitting  at 
precisely  the  same  time.  By  adding  some  degree  of 
randomness  to  the  RT O,  the  retransmissions  can  be 
desynchronized  such  that  fewer  flows  experience 
repeated  timeouts. 

We  examine  both  RT Omin  and  the  retransmission 
synchronization  effect  in  simulation,  measuring  the 
goodput  for  three  different  settings:  a  200/rs  RT 0,„in, 
a  20,us  RT  Omin,  and  a  20/rs  RT  Omin  with  a  modified, 
randomized  timeout  value  set  by: 

timeout  =  RTO  +  {rand(0.5)  *RTO) 

Figure  13  shows  that  the  200/rs  RTOmin  scales 
poorly  as  the  number  of  concurrent  senders  in¬ 
creases:  at  1024  servers,  throughput  is  still  an  order 
of  magnitude  lower.  The  20/rs  RTOmin  shows  im¬ 
proved  performance,  but  eventually  suffers  beyond 
1024  servers  due  to  the  successive,  simultaneous 
timeouts  experienced  by  a  majority  of  flows. 

Adding  a  small  random  delay  performs  well  re¬ 
gardless  of  the  number  of  concurrent  senders  because 
it  explicitly  desynchronizes  the  retransmissions  of 
flows  that  experience  repeated  timeouts,  and  does  not 
heavily  penalize  flows  that  experience  a  few  time¬ 
outs. 

A  caveat  of  these  event-based  simulations  is  that 


the  timing  of  events  is  as  accurate  as  the  simula¬ 
tion  timestep.  At  the  scale  of  microseconds,  there 
will  exist  small  timing  and  scheduling  differences  in 
the  real-world  that  are  not  captured  in  simulation. 
For  example,  in  the  simulation,  a  packet  will  be  re¬ 
transmitted  as  soon  as  the  retransmission  timer  fires, 
whereas  kernel  scheduling  may  delay  fhe  acfual  re- 
fransmission  by  10/rs  or  more.  Even  when  offloading 
duties  fo  ASICs,  slighf  liming  differences  will  exisl 
in  real-world  swilches.  Hence,  fhe  real-world  behav¬ 
ior  of  incasf  in  lOGE,  20/rs  RTT  nefworks  will  likely 
deviafe  from  fhese  simulations  slighfly,  Ihough  fhe 
general  Irend  should  hold. 

8  Implications  of  Reduced  RTOmin 
on  the  Wide-area 

Aggressively  lowering  bofh  fhe  RTO  and  RTOmin 
shows  praclical  beneflls  for  dalacenlers.  In  Ihis  sec- 
fion,  we  invesfigafe  if  reducing  fhe  RT  Omin  value  lo 
microseconds  and  using  finer  granularily  timers  is 
safe  for  wide  area  Iransfers.  We  And  lhal  fhe  im- 
pacl  of  spurious  fimeoufs  on  long,  bulk  dala  flows 
is  very  low  -  wifhin  fhe  margins  of  error  -  allowing 
RT O  fo  go  info  fhe  microseconds  wilhouf  impairing 
wide-area  performance. 

8.1  Evaluation 

The  major  polenlial  effecl  of  a  spurious  limeoul  is  a 
loss  of  performance:  a  flow  lhal  experiences  a  lime- 
oul  will  reduce  ifs  slow-sfarl  Ihreshold  by  half,  ils 
window  lo  one  and  alfempl  lo  rediscover  link  capac- 
ily.  Spurious  fimeoufs  occur  nol  when  fhe  nelwork 
palh  drops  packels,  but  rather  when  it  observes  a  sud¬ 
den,  higher  delay,  so  the  effect  of  a  shorter  RT O  on 
increased  congestion  is  likely  small  because  a  TCP 
sender  backs-off  on  the  amount  of  data  it  injects  into 
the  network  on  a  timeout.  In  this  section  we  analyze 
the  performance  of  TCP  flows  over  the  wide-area  for 
bulk  data  transfers. 

Experimental  Setup  We  deployed  two  servers 
that  differ  only  in  their  implementation  of  the  RT  O 
values  and  granularity,  one  using  the  default  Einux 
2.6.28  kernel  with  a  200ms  RTOmin,  and  the  other 
using  our  modified  hrlimer-enabled  TCP  slack  wilh 
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RTT  (ms) 

Figure  14:  A  comparison  of  RTT  distributions  of 
flows  collected  over  3  days  on  the  two  configurations 
show  that  both  servers  saw  a  similar  distribution  of 
both  short  and  long-RTT  flows. 

a  200^s  RTOmin-  We  downloaded  12  torrent  files 
consisting  of  various  Linux  distributions  and  begin 
seeding  all  content  from  both  machines  on  the  same 
popular  swarms  for  three  days.  Each  server  uploaded 
over  30GB  of  data,  and  observed  around  70,000 
flows  (with  non-zero  throughput)  over  the  course  of 
three  days.  We  ran  tcpdump  on  each  machine  to  col¬ 
lect  all  uploaded  traffic  packet  headers  for  later  anal¬ 
ysis. 

The  TCP  RTO  value  is  determined  by  the  esti¬ 
mated  RTT  value  of  each  flow.  Other  factors  be¬ 
ing  equal,  TCP  throughput  tends  to  decrease  with 
increased  RTT.  To  compare  RTO  and  throughput 
metrics  for  the  2  servers  we  first  investigate  if  they 
see  similar  flows  with  respect  to  RTT  values.  Fig¬ 
ure  14  shows  the  per-flow  average  RTT  distribution 
for  both  hosts  over  the  three  day  measurement  pe¬ 
riod.  The  RTT  distributions  are  nearly  identical,  sug¬ 
gesting  that  each  machine  saw  a  similar  distribution 
of  both  short  and  long-RTT  flows.  The  per-packet 
RTT  distribution  for  both  flows  are  also  identical. 

Figure  15  shows  the  per-flow  throughput  distribu¬ 
tions  for  both  hosts,  filtering  out  those  flows  with 
a  bandwidth  less  than  100bps,  which  are  typically 
flows  sending  small  control  packets.  The  throughput 
distributions  for  both  hosts  are  also  nearly  identical  - 
the  host  with  RT Omin  =  200qs  did  not  perform  worse 
on  the  whole  than  the  host  with  RT 0,„m  =  200ms. 

We  split  the  throughput  distributions  based  on 


Figure  15:  The  two  configurations  observed  an  iden¬ 
tical  throughput  distribution  for  flows.  Only  flows 
with  throughput  over  100  bits/s  were  considered. 


Figure  16:  The  throughput  distribution  for  short  and 
long  RTT  flows  shows  negligible  difference  across 
configurations. 

whether  the  flow’s  RTT  was  above  or  below  200ms. 
For  flows  above  200ms,  we  use  the  variance  in  the 
two  distributions  as  a  control  parameter:  any  vari¬ 
ance  seen  above  200ms  are  a  result  of  measurement 
noise,  because  the  RT 0,„in  is  no  longer  a  factor.  Fig¬ 
ure  16  shows  that  the  difference  between  the  distri¬ 
bution  for  flows  below  200ms  is  within  this  measure¬ 
ment  noise. 

Table  1  lists  statistics  for  the  number  of  spurious 
timeouts  and  normal  timeouts  observed  by  both  the 
servers.  These  statistics  were  collected  using  tcp- 
trace  [2]  patched  to  detect  timeout  events  [1].  The 
number  of  spurious  timeouts  observed  by  the  two 
configurations  are  high,  but  comparable.  We  at- 
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RT  Omin  Normal  Spurious  Spurious 

timeouts  timeouts  Fraction 
200ms  137414  47094  25.52% 

200^s  264726  90381  25.45% 


Table  1:  Statistics  for  timeout  events  across  flows 
for  the  2  servers  with  different  RT Omin  values.  Both 
servers  experience  a  similar  %  of  spurious  timeouts. 


100  - ' - ' - ' - 

0  500  1000  1500  2000 

Time  (sec) 

RTT  +  RTO  - 

Figure  17:  RTT  and  RTO  estimate  over  time  for  a 
randomly  picked  flow  over  a  2000  second  interval  - 
the  RTT  and  RT O  estimate  varies  significantly  in  the 
wide-area  which  may  cause  spurious  timeouts. 

tribute  the  high  number  of  spurious  timeouts  to  the 
nature  of  TCP  clients  in  our  experimental  setup.  The 
peers  in  our  BitTorrent  swarm  observe  large  varia¬ 
tions  in  their  RTTs  (2x-3x)  due  to  the  fact  that  they 
are  transferring  large  amounts  of  data,  frequently  es¬ 
tablishing  and  breaking  TCP  connections  to  other 
peers  in  the  swarm,  resulting  in  variations  in  buffer 
occupancy  of  the  bottleneck  link.^  Figure  17  shows 
one  such  flow  picked  at  random,  and  the  per-packet 
RTT  and  estimated  RT O  value  over  time.  The  dip 
in  the  estimated  RT O  value  below  the  350ms  RTT 
band  could  result  in  a  spurious  timeout  if  there  were 
a  timeout  at  that  instance  and  the  instantaneous  RTT 
was  350ms. 

This  data  suggests  that  reducing  the  RTOmin  to 
200qs  in  practice  does  not  affect  the  performance  of 
bulk-data  TCP  flows  on  the  wide-area. 

^An  experiment  run  by  one  of  the  authors  discovered  that 
RTTs  over  a  residential  DSL  line  varied  from  66ms  for  one  TCP 
stream  to  two  seconds  for  four  TCP  streams. 


9  Related  Work 

TCP  Improvements:  A  number  of  TCP  improve¬ 
ments  over  the  years  have  improved  TCP’s  ability  to 
respond  to  loss  patterns  and  perform  better  in  partic¬ 
ular  environments,  many  of  which  are  relevant  to  the 
high-performance  datacenter  environment  we  study. 
NewReno  and  SACK,  for  instance,  reduce  the  num¬ 
ber  of  loss  patterns  that  will  cause  timeouts;  prior 
work  on  the  incast  problem  showed  that  NewReno, 
in  particular,  improved  throughput  during  moderate 
amounts  of  incast,  though  not  when  the  problem  be¬ 
came  severe  [28]. 

TCP  mechanisms  such  as  Limited  Transmit  [3] 
were  specifically  designed  to  help  TCP  recover  from 
packet  loss  when  window  sizes  are  small — exactly 
the  problem  that  occurs  during  incast.  This  solution 
again  helps  maintain  throughput  under  modest  con¬ 
gestion,  but  during  severe  incast,  the  most  common 
loss  pattern  is  the  loss  of  the  entire  window. 

Finally,  proposed  improvements  to  TCP  such  as 
TCP  Vegas  [10]  and  FAST  TCP  [19]  can  limit  win¬ 
dow  growth  when  RTTs  begin  to  increase,  often 
combined  with  more  aggressive  window  growth  al¬ 
gorithms  to  rapidly  fill  high  bandwidth-delay  links. 
Unlike  the  self-interfering  oscillatory  behavior  on 
high-BDP  links  that  this  prior  work  seeks  to  resolve. 
Incast  is  higgered  by  the  arrival  and  rapid  ramp-up  of 
numerous  competing  flows,  and  the  RTT  increases 
drastically  (or  becomes  a  full  window  loss)  over  a 
single  round-trip.  While  an  RTT-based  approach  is 
an  interesting  approach  to  study  for  alternative  solu¬ 
tions  to  Incast,  it  is  a  matter  of  considerable  future 
work  to  adapt  existing  techniques  for  this  purpose. 

Efficient,  fine-grained  kernel  timers.  Where 
our  work  depends  on  hardware  support  for  high- 
resolution  kernel  timers,  earlier  work  on  “soft 
timers”  shows  an  implementation  path  for  legacy 
systems  [5].  Soft  timers  can  provide  microsecond- 
resolution  timers  for  networking  without  introducing 
the  overhead  of  context  switches  and  interrupts.  The 
hrtimer  implementation  we  do  make  use  of  draws 
lessons  from  soft  timers,  using  a  hardware  interrupt 
to  trigger  all  available  software  interrupts. 

Understanding  RTOmin-  The  origin  of  concern 
about  the  safety  and  generality  of  reducing  RT  Omin 
was  presented  by  Allman  and  Paxson  [4] ,  where  they 
used  trace-based  analysis  to  show  that  there  existed 
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no  optimal  RTO  estimator,  and  to  what  degree  that 
the  TCP  granularity  and  RT Omin  had  an  impact  on 
spurious  retransmissions.  Their  analysis  showed  that 
a  low  or  non-existent  RT  Omin  greatly  increased  the 
chance  of  spurious  retransmissions  and  that  tweaking 
the  RT  Omin  had  no  obvious  sweet-spot  for  balancing 
fast  response  with  spurious  timeouts.  They  showed 
the  increased  benefit  of  having  a  fine  measurement 
granularity  for  responding  to  good  timeouts  because 
of  the  ability  to  respond  to  minor  changes  in  RTT 
Last,  they  suggested  that  the  impact  of  bad  timeouts 
could  be  mitigated  by  using  the  TCP  timestamp  op¬ 
tion,  which  later  became  known  as  the  Eifel  algo¬ 
rithm  [21].  F-RTO  later  showed  how  to  detect  spuri¬ 
ous  timeouts  by  detecting  whether  the  following  ac¬ 
knowledgements  were  for  segments  not  retransmit¬ 
ted  [32],  and  this  algorithm  is  implemented  in  Linux 
TCP  today. 

Psaras  and  Tsaoussidis  revisit  the  minimum  RTO 
for  high-speed,  last-mile  wireless  links,  noting  the 
default  RT  Omin  is  responsible  for  worse  throughput 
on  wireless  links  and  short  flows  [29].  They  suggest 
a  mechanism  for  dealing  with  delayed  ACKs  that  at¬ 
tempts  to  predict  when  a  packet’s  ACK  is  delayed  -  a 
per-packet  RT  Omin-  We  find  thaf  while  delayed  ACK 
can  affecl  performance  for  low  RTOmin,  the  benefits 
of  a  low  RT  Omin  far  outweigh  the  impact  of  delayed 
ACK  on  performance.  Regardless,  we  provide  alter¬ 
native,  backwards-compatible  solutions  for  dealing 
with  delayed  ACK. 


10  Conclusion 

This  paper  presented  a  practical,  effective,  and  safe 
solution  to  eliminate  TCP  throughput  degradation 
in  datacenter  environments.  Using  a  combination 
of  microsecond  granularity  timeouts,  randomized  re¬ 
transmission  timers,  and  disabling  delayed  acknowl¬ 
edgements,  the  techniques  in  this  paper  allowed 
high-fan-in  datacenter  communication  to  scale  to  47 
nodes  in  a  real  cluster  evaluation,  and  potentially  to 
thousands  of  nodes  in  simulation.  Through  a  wide- 
area  evaluation,  we  showed  that  these  modifications 
remain  safe  for  use  in  the  wide-area,  providing  a  gen¬ 
eral  and  effective  improvement  for  TCP-based  clus¬ 
ter  communication. 
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